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ABSTRACT 

Autism spectrum disorder (ASD) is a heterogeneous 
neurodevelopmental disorder with a prevalence 
of 0.9-2.6%. Twin studies showed a heritability of 
38-90%, indicating strong genetic contributions. 
Yet it is unclear how many genes have been 
associated with ASD and how strong the evidence 
is. A comprehensive review and analysis of literature 
and data may bring a clearer big picture of autism 
genetics. We show that as many as 2193 genes, 
2806 SNPs/VNTRs, 4544 copy number variations 
(CNVs) and 158 linkage regions have been asso- 
ciated with ASD by GWAS, genome-wide CNV 
studies, linkage analyses, low-scale genetic associ- 
ation studies, expression profiling and other 
low-scale experimental studies. To evaluate the 
evidence, we collected metadata about each study 
including clinical and demographic features, experi- 
mental design and statistical significance, and used 
a scoring and ranking approach to select a core 
data set of 434 high-confidence genes. The genes 
mapped to pathways including neuroactive ligand- 
receptor interaction, synapse transmission and 
axon guidance. To better understand the genes we 
parsed over 30 databases to retrieve extensive data 
about expression patterns, protein interactions, 
animal models and pharmacogenetics. We con- 
structed a MySQL-based online database and 
share it with the broader autism research 
community at http://autismkb.cbi.pku.edu.cn, sup- 
porting sophisticated browsing and searching 
functionalities. 

INTRODUCTION 

Autism spectrum disorder (ASD) is a heterogeneous 
neurodevelopmental disorder characterized by impair- 
ments in reciprocal social interaction and communication 



and presence of restricted, repetitive and stereotyped 
patterns of behavior, interests and activities (1). ASD is 
an umbrella term for Autistic Disorder, Asperger 
Syndrome and Pervasive Developmental Disorder Not 
Otherwise Specified (PDD-NOS) (1). With an early 
onset prior to age 3 and a prevalence as high as 
0.9-2.6% (2,3), ASD is one of the leading causes of child- 
hood disability and inflicts serious suffering and burden 
for the family and society (4). 

Understanding the causes of ASD is critical for develop- 
ing better treatment. Twin studies have shown that the 
heritability of ASD is as high as 38-90%, indicating 
strong contributions by genetic factors as well as environ- 
mental factors (5,6). The search for environmental factors 
has not yet led to convincing major candidates whereas 
the search for genes associated with autism, although far 
from complete or conclusive, has been more fruitful. The 
genes discovered so far can be roughly grouped into two 
categories: 'syndromic autism related genes' or causal 
genes underlying genetic disorders that cause autistic 
symptoms such as Fragile X Syndrome, Rett Syndrome, 
Tuberous Sclerosis Complex and dozens of other dis- 
orders (7,8), and 'non-syndromic autism related genes' 
most of which are susceptibility genes (9). Many experi- 
mental methods have been used to identify associated 
genes, including the earlier linkage analyses and low- 
scale candidate gene association or experimental studies 
as well as the more recent genome-wide association 
studies (GWAS), genome-wide CNV studies and expres- 
sion profiling. 

With hundreds of studies published, especially the 
recent genome-wide studies, and with the next-generation 
sequencing technologies providing even more power for 
further gene discoveries (10), a new challenge has 
emerged: it has become more and more difficult for an 
autism researcher to answer with confidence how many 
genes have been associated with ASD, how strong the 
evidence is, what features the genes have and what 
pathways they involve. The amount of available literature 
and data and the intrinsic complexity of autism genetics 
demand bioinformatic data management and analysis. 
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Three efforts have been made so far by different groups to 
collect genes and variations associated with ASD: AutDB 
(also known as SAFRI Gene) collected 219 genes (11,12), 
Autism genetic database (AGD) collected 226 genes and 
743 CNVs (13) and Autism Chromosome Rearrangement 
Database (ACRD) collected 372 breakpoints and other 
genomic features (14). However, they are far from a com- 
prehensive survey of autism genetics. To bring a clearer 
big picture of autism genetics, we performed a comprehen- 
sive review and analysis of published literature and data, 
described below, resulting in a total of 2193 genes, 2806 
SNPs/VNTRs, 4544 CNVs and 158 linkage regions. We 
provide the results as an online resource for the broader 
autism research community at http://autismkb.cbi.pku 
.edu.cn/ with extensive evidence and annotations, support- 
ing sophisticated browsing and searching functionalities. 

DATA COLLECTION 

Literature search 

We searched the PubMed database for publications 
related to autism genetics, using the query term 'autism 
AND associat*' for association studies, 'autism AND 
(gene OR microarray OR proteomics)' for expression 
profiling studies and the other low-scale experimental 
studies, and 'autism AND (CNV OR copy number vari- 
ation OR microarray* OR microdel* OR microdup* OR 
rearrange* OR (genome-wide AND (linkage OR associa* 
OR scan)))' for CNV and linkage studies. The abstracts 
of the 4000+ articles retrieved were reviewed to remove 
irrelevant papers, resulting in a final set of 579 articles, 
reporting a total of 11 GWAS, 242 low-scale candidate 
gene association studies, 13 expression profiling studies, 
95 genome-wide CNV studies, 23 genome-wide linkage 
analyses and 236 other low-scale experimental studies. 

For syndromic autism-related genes, we first collected 
the autism-related disorders and their causal genes from 
a recently published comprehensive review (7). We then 
searched OMIM to get the official disease names 
and linked all the disorders to OMIM, and searched 
PubMed for additional citations using the query 
'(OMIM disease name) AND autism 1 for each disease. 
All citations were double-checked manually. Finally, 99 
genes for 94 autism-related disorders supported by 250 
references were included in our data set of 'Syndromic 
Autism Related Genes'. 

In total, we collected as many as 2135 non-syndromic 
autism-related genes, 99 syndromic autism-related genes, 
4544 CNVs and 158 linkage regions. The genes located in 
the CNV and linkage regions were then retrieved by the 
UCSC Genome Browser (15). 

Evidence collection 

To establish the strength of evidence, we collected 
metadata about each study and result. Supplementary 
Table S1-S7 list the evidence collected for each type 
of experimental methods. In summary, for each study of 
non-syndromic autism, we collected the clinical and demo- 
graphic features of the samples including ancestral back- 
ground, country of origin, inclusion and exclusion criteria, 



number of cases and controls with gender ratio, age at 
examination and diagnosis criteria. We collected metadata 
about the experimental design including platform, 
experimental methods, statistical methods and statistical 
significance. 

For each gene, we estimated how much evidence 
supports its role in autism by each type of experimental 
methods and calculated a weighted sum, following a 
multi-dimensional evidence-based candidate gene priori- 
tization approach (16). First, we assigned initial scores 
to the genes for each type of experimental methods 
(Supplementary Table S8). Score 0 is given if there is no 
positive evidence of the type. Table 1 lists the distribution 
of the scores for each type. Next, we used a benchmark 
data set consisting of 21 non-syndromic autism-related 
genes considered high confidence from six autism 
reviews (8,9,17-20) (Supplementary Table S9) to calculate 
the weights. We followed a gene prioritization approach 
(16) to generate a candidate weight matrix pool consisting 
of d N = 7 6 weight vectors, where N represents the number 
of experimental methods and d = 7V+1 represents possible 
different weights, 1-7 in the weight vectors. A combined 
score for each gene was then calculated by summing up 
the products of the scores and corresponding weights from 
the six experimental methods (16). All the 2135 candidate 
genes including 21 benchmark genes were sorted by their 
combined scores. We selected the weight matrix that gave 
the benchmark genes the highest rank as the optimal 
weight matrix (Supplementary Table S10). About 95% 
benchmark genes were ranked among the top 98% of all 
candidate genes. We chose the lowest combined score, 9, 
from the benchmark data set as the cutoff of high- 
confident genes, resulting in a core data set of 383 non- 
syndromic autism-related genes. Because the definition of 
'optimal weight matrix' is always debatable, we provide an 
online ranking tool to allow users to re-rank the genes 
interactively by inputting customized weights based on 
their own experiences and preferences. 



Table 1. Score distribution of genes discovered by each experiment 
method 



Experimental 


Scores 


Number 


methods 




of Genes 


Genome-wide association studies 


1 


81 




2 


46 




3 


5 


Expression profiling 


1 


1320 




2 


285 




3 


50 


Genome-wide CNV studies 


1 


1086 




2 


34 




3 


19 


Linkage analyses 


1 


535 




2 


43 




3 


0 


Low scale genetic association studies 


1 


128 




2 


23 




3 


12 


Other low-scale experimental studies 


1 


241 




2 


37 




3 


30 



D1018 Nucleic Acids Research, 2012, Vol. 40, Database issue 



For syndromic autism, we assigned four levels to the 
autism-related disorders: Level 1 disorders have one 
reported case with autistic symptoms, Level 2 have two 
to three cases in a single family, Level 3 have cases in more 
than one family and Level 4 are reported in multiple 
review papers (8). Causal genes of Level 3 and 4 disorders 
were considered high-confident genes in the core dataset. 

Functional annotations 

Fo better understand the function of the genes associated 
with autism, we collected extensive functional information 
and data, including crosslinks to NCBI Entrez gene (21), 
OMIM (21), Uniprot (http://www.uniprot.org/) and 
Ensembl (http://www.ensembl.org/), functional groups 
based on Gene Ontology (http://www.geneontology.org/), 
protein-protein interactions from database BioGRID 
(22), BIND (23) and HPRD (24), and genomic variants 
from the Database of Genomic Variants (DGV) (25). 
We linked the genes to three psychiatric disease databases, 
AlzGene (26), SzGene (27) and PDGene (http://www 
.pdgene.org/), when the gene is common between these 
diseases and ASD. Information about homologues of 
the genes were retrieved from Mouse Genome 
Informatics (MGI) (28), Zebrafish Model Organism 
Database (ZFIN) (29) and FlyBase (30). We collected 
comprehensive mRNA expression profiling data, 
including ESFs from NCBI Unigene Profiles (21), micro- 
array expression profiles from BioGPS (31) and Allen 
Brain Atlas (32), and RNA-Seq (33-38). Protein expres- 
sion evidence at peptide level was retrieved from PRIDE 
(39) and Peptide Atlas (40). We also collected transcrip- 
tion factor binding sites in the upstream regions of the 
genes from in-house collection of ChlP-Chip and 
ChlP-Seq data, miRNAs that may target the genes from 
miRWalk (41) and FarBase (42), and natural antisense 



transcripts that may regulate the genes from NAFsDB 
(43). Possible post-translation modifications were 
retrieved from UniProt and dbPFM (44). We used 
KOBAS 2.0 (45) to retrieve the pathways that the genes 
are involved in from BioCyc (46), KEGG Pathway (47), 
PID (48), PID Reactome (48), PANFHER (49) and 
Reactome (50) and possible association with other 
diseases from Disease databases include KEGG Disease 
(51), FunDO (52,53), GAD (54), NHGRI GWAS Catalog 
(55) and OMIM (21). Pharmaco-genetics and drug infor- 
mation was collected from Comparative Foxicogenomics 
Database (CFD) (56), Pharmacogenomics Knowledge 
Base (57) and DrugBank (58). Supplementary Fable Sll 
summarizes the gene coverage from each source database. 
Fhe overlap between the genes discovered by expression 
profiling and those by the other genetic technologies is 
shown in Supplementary Fable SI 2. 

Enriched functional pathways were identified by 
KOBAS 2.0 (45) and enriched GO terms were identified 
by DAVID (59). Pathways such as neuroactive ligand- 
receptor interaction, synapse transmission, and axon 
guidance were statistically significantly enriched in the 
core data set (Fable 2). In addition to synapse transmis- 
sion, GO terms such as transmission of nerve impulse, 
neuron differentiation were also found to be statistically 
significant (Fable 3). Fhe result is consistent with recent 
findings that synapse development, axon targeting and 
neuron motility are related to autism etiology (60,61). 



DATABASE INTERFACE 

We set up a MySQL relational database to store all the 
data. A user-friendly web interface for browsing and 
searching was implemented by PHP and JavaScript, 
powered by JQuery framework. 



Table 2. Top five enriched pathway of the genes in the high-confident core dataset, using KOBAS2.0 



Term 


Database 


ID 


P Value 


Q Value 


Neuroactive ligand-receptor interaction 


KEGG PATHWAY 


hsa04080 


1.03E-11 


1.65E-09 


Synaptic Transmission 


Reactome 


REACT: 13685 


7.50E-10 


9.06E-08 


Axon guidance 


Reactome 


REACT: 18266 


1.29E-08 


1.24E-06 


Calcium signaling pathway 


KEGG PATHWAY 


hsa04020 


2.25E-08 


1.97E-06 


Long-term potentiation 


KEGG PATHWAY 


hsa04720 


1.76E-07 


9.98E-06 



Table 3. Top 10 enriched GO terms of the genes in the high-confident core dataset 



GO ID 


GO Term 


P Value 


Q Value 


GO:0019226 


transmission of nerve impulse 


5.44E-29 


9.73E-26 


GO:0007268 


synaptic transmission 


4.59E-28 


8.21E-25 


GO:0007610 


synapse 


1.05E-23 


1.45E-20 


GO:0045202 


behavior 


4.53E-23 


8.10E-20 


GO:0044057 


synapse part 


7.21E-22 


9.94E-19 


GO:0007267 


regulation of system process 


4.12E-21 


7.38E-18 


GO:0044456 


cell-cell signaling 


4.17E-21 


7.46E-18 


GO:0030182 


neuron differentiation 


8.21E-19 


1.47E-15 


GO:0031644 


regulation of neurological system process 


1.53E-18 


2.74E-15 


GO:0051969 


regulation of transmission of nerve impulse 


1.74E-18 


3.11E-15 



Nucleic Acids Research, 2012, Vol. 40, Database issue D1019 



B 



D 



View Evidences 



View Annotations 



Basic Information ifc; 

Gene Symbol: BDNF ( MGC34632 ) 



Gene Full Name: 
Band: 

Quick Links 
~ Sequences £k: 



brain-derived neurotrophic factor 
llpl3 

Entrez ID:6z7; OMIM: 113505 ; Unlprot ID: BDNF human ; ENSEMBLID: ENSG00000176697 



Nucletide Sequence ■ Protein Sequence 



>BDHF [ 627 1 nuclecide 



submit to WebLab 



Evidence Statistic & 


Evidences 


Syndromic Gene GWAS Expression 


CNV 


Linkage Low Scale Association 


Other Studies 


Total 


Score (No. of Studies) 


No D f0~) 0 fO) 


1 (1) 


0 (V) 1 (3^ 




25 (8) 



-'"Syndromic Autism Gene it 
Low Scale Association Studies (by Ethnic Group) £1 



Family Based Association Studies: 2 










Affected s 




Reference 


Population 


#SNPs/ #VNTRs 


#Families 


#Subjects 
(o/o Women) 


ADI-R 


ADOS 


Diagnosis 


Age 
(range) 


IQ 
(range) 


Result 


CAUCASIAN 


Philippe, 2002 1 


PARIS 


1 (detail) 


39 


78 


*/ 


X 


AD 


16.5 
rr ac\ 







Total number of polymorphisms investigated by this study: 4 



Polymorphism ID# „ ... 
(LinktoNCBI) Results 



CASE-CONTROL ASSOCIATION STUDIES (BY STATISTIC METHOD) 

Alele Type Genotype 

Allele 1 Allele 2 Allele 1/ Allele 1 Allele 1/Allele 2 Allele 2/Allele 2 

C G C/C C/G G/G 

Negative ASD 135 (54.44%) 113 (45.56%) 33 (26.61%) 69 (55.65%) 22 (17.74%) 

CTR 128 (53.33%) 112(75.17%) 37 (30.83%) 54 (45.00%) 29 (24.17%) 




ASIAN 



2009 1 



124 
(13.71%) 



5.36(male), 
4.57(female) 



120 

(") 



' Genome-Wide Association Studies(By Ethnic Group) t 
Other Low Scale Candidate Gene Studies I 



Reference 




Orangnism 


Tissue 


ADIR ADOS 


Diagnosis 


Evidence Level 


Result 


Nelson, 2001 


1 


human 


neonatal blood 




ASD 


protein expression 








Gadow, 2009 


1 


human 


blood 


X X 


ASD 


genetics 








Sheikh, 2010 


1 


human 


brain 


X 


autism 


protein expression 








Chenq, 2010 


1 


human 






autism 


genetics: association study 



























- Large Scale Expression Studies t 
CNV Studies A 



Number of CNVs 



CNV Name Chr Band Gain/Loss 


Case 


Control 


Evidence Type 


Reference 


AutCNV0000707 


11 llpl4.2-13 


loss 


1 


CNVs Only Present In Patients; 


Shinawi, 2011 



Linkage Studies -i; 



Figure 1. A typical gene entry in AutismKB. (A) Basic information and quick links, (B) nucleotide and protein sequences, (C) evidence statistics and 
links to different data sources, (D) example of default collapsed data source, (E) link and example of polymorphism information and (F) example of 
expanded data source with hidden information. 
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Browsing 

Users can browse the data in AutismKB in a variety of 
ways, including by data sets, experimental methods or 
chromosome. The gene lists include a summary of informa- 
tion about the genes, hyperlinked to detailed gene evidence 
and annotation pages. Figure 1 shows a typical AutismKB 
gene entry. Basic information such as gene symbol, gene 
name, cytoband and cross links are provided (Figure 1A). 
Nucleotide sequences and protein sequences can be sent to 
WebLab (62) for further analysis (Figure IB). Summaries 
of supporting evidence and category-specific scores are 
provided (Figure 1C). Users can click on the hyperlinks 
of the category-specific score to view different category of 
evidences. The categories without any evidence are hidden 
by default (Figure ID). Users can click on '+' to expand or 
'— ' to collapse different categories. Detailed information 
of polymorphisms for low scale association studies and 
GWAS can be found by clicking on 'detail' in the tables 
(Figure IE). When exploring other low-scale studies and 

A 



large-scale expression studies, users can click the down 
arrow in the right of the table to obtain more information 
(Figure IF). Annotations of each gene can be obtained by 
clicking the label 'view annotation' in the top left. 

CNVs are provided by a tabular view with name, 
cytoband, gain or loss, number, evidence types and refer- 
ence. Users can use evidence type and chromosome to 
filter the table (Figure 2A). Clicking on the name can 
bring the detail information of each CNV including the 
samples and methods of the study, CNV region, and any 
syndromic and non-syndromic autism genes in the region 
(Figure 2B). Users can use chromosome to filter the 
linkage regions and click on linkage name to view 
detailed information. 

Searching 

AutismKB supports both text-based search and 
sequence-based search. Users can find a quick search 
box on the top right of each page to search by gene 



View Copy Number Variants (CNVs) 

View by Evidence Type: All ( CNVs Only Present In Patients Denovo CNVs Overlapping/Recurrent CNVs 

CNVs Overlapping With A CRD CNVs Not Present In Control Significant Enriched CNVs Others All ) 



View by Chromosorre:AII (12345678?10U121J14151^1718W202ia)< Y AO ) 



B 



Number of CNVs 



CNV Name 


Chr 


Band 


Gain/Loss 


Case 


Control 


Evidence Type 


Reference 


AutCNVOOOOOOl 


2 


2q24.2 


loss 


1 




Denovo CNVs; 


Sebat 2007 


AutCNV0000002 


2 


2q37.2-q37.3 


loss 


1 




Denovo CNVs; 


Sebat 2007 




2 


2q37.3 


loss 


1 




Denovo CNVs; 


Sebat, 2007 


uj LC^Jo CijmjD04 


3 


3pl4.2 


loss 


1 




Denovo CNVs; 


Sebat, 2007 















Detail Information of AutCNV0000003 



1. Sample AND Method t 



Ancestry 






Diagnosis 




Family 


Individual 




ADI-R 


ADOS 


Total 


Simplex Multiplex 


Control Affected Control 


Reference 

Total 


Caucasion 


V 


X 


autism 


165 


118 47 


99 195 196 


391 Sebat. 2007 


Population: 


USA 


Platform: 


Agilent 244K array. 390K ROMA array 


Method: 


aCGH 


Marker: 




Band: 


2q37.3 


Evidences: 


Denovo CNVs: 


Region: 


Chr2: 238217066 -242701103 


Gain/Loss: 


loss 



2. Related Syndromic Genes t 



Gene ID 


Gene Symbol 


inheritance 


Evidence Level 


No data I 



Abbreviations: AD, autosomal dominant; AR, autosomal recessive; ASD, autism spectrum disorder; ID, intellectual disability; XL, X linked. 

~ Z. Related Non-syndromic Genes i 



Gene ID 


Gene Symbol 






Evidence Score 








Total Score 




Low-Scale Association 


GWAS 


Low-Scale Single Gene 


Expression 


CNV 


Linkage 


AutG55502 


HES6 


0 


0 


0 1 


1 


0 


4 


AUIG4735 


SEPT2 


0 


0 


0 1 


1 


0 


4 





Sebat, 2007 




Sebat, 2007 




Sebat, 2007 




Sebat, 2007 




Sebat, 2007 




Sebat, 2007 




Sebat 2007 


; ; Denovo CNVs ; 


Weiss, 200B 


3 ; Denovo CNVs ; 


Weiss, 2008 



i; Denovo CNVs; 



Sebat. 1007 
Sebat 2007 
Sebat 2007 
Sebat 2007 



Weiss, 2008 
Szatmari, 2007 
Szatmari, 2007 
Szatmari, 2007 



D ; 


Szatmari, 2007 


D ; 


Szatmari, 2007 


D ; 


Szatmari, 2007 



Szatmari, 2007 



; Counts: 1/182 

12 3 Next Last 



Figure 2. CNV list and a typical CNV entry in AutismKB. (A) the CNV list in AutismKB and (B) a typical CNV entry. 
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symbol. Advanced search was provided to search genes, 
CNVs, linkage regions by gene name, gene symbol, NCBI 
Entrez id, Ensemble id, GO terms, UniProt ID, location, 
score, method and PubMed ID. Finally, a BLAST search 
against the nucleotide or protein sequences of all 
AutismKB genes is also available. 



CONCLUSION 

AutismKB is a comprehensive knowledgebase of 
autism-related genes, CNVs and linkage regions with ex- 
tensive evidence and annotations. AutismKB will be 
updated periodically. We hope that it can be a valuable 
resource for the autism research community. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Tables 1-12. 
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